Resource on github document format here, on html format here
suppressPackageStartupMessages(library(tidyverse))
## Warning: package 'dplyr' was built under R version 3.4.2
knitr::opts_chunk$set(fig.width=12, fig.height=9)
library(knitr)
library(kableExtra)
options(knitr.table.format = "html")
library(readr)
library(forcats)
#install.packages("gdata")
library(gdata)
## gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.
##
## gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.
##
## Attaching package: 'gdata'
## The following objects are masked from 'package:dplyr':
##
## combine, first, last
## The following object is masked from 'package:purrr':
##
## keep
## The following object is masked from 'package:stats':
##
## nobs
## The following object is masked from 'package:utils':
##
## object.size
## The following object is masked from 'package:base':
##
## startsWith
library(lattice)
# install.packages("RColorBrewer")
library(RColorBrewer)
#install.packages("tidygenomics") # Not used for now, but perhaps in another homework
#library(tidygenomics) # Not used for now, but perhaps in another homework
Vincenzo Coia has approved my request to use published genomic data for the next assignments, therefore, I want to provide a few clarifications:
Thibodeau, M. L. et al. Genomic profiling of pelvic genital type leiomyosarcoma in a woman with a germline CHEK2:c.1100delC mutation and a concomitant diagnosis of metastatic invasive ductal breast carcinoma. Cold Spring Harb Mol Case Stud mcs.a001628 (2017). doi:10.1101/mcs.a001628
Open access article and data available here
chek2_rna_cnv <- read.table("/Users/mylinh/Desktop/chek2-data-trial-stat545/chek2-rna-expression-cnv-data.txt", sep="\t", strip.white = TRUE, header = TRUE)
View(chek2_rna_cnv)
# class(chek2_rna_cnv)
# typeof(chek2_rna_cnv)
summary(chek2_rna_cnv) %>% kable(format = "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), position = "center")
|
|
|
|
|
|
|
| copy.change |
| avg.cna.by.gene | breakpoint | manually.curated.homd | loh | loh_ratio | intepreted.expression.status |
| SARC.percentile | SARC.kIQR | FC.mean.Bodymap | avg.TCGA.percentile | avg.TCGA.kIQR | avg.TCGA.norm.percentile | avg.TCGA.norm.kIQR | UCEC.norm.percentile | UCEC.norm.kIQR |
| |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 : 2042 | Min. : 5810 | Min. : 31427 | Min. :-1.00000 | p13.3 : 620 | ENSG00000000003: 1 | : 356 | : 34 | Min. :-2.00000 | Min. :-0.45000 | Min. :-0.47000 | Min. :0.00000 | Min. :0.00000 | : 92 | Min. :0.1800 | :18932 | Min. : 0.00 | Min. : 0.00 | Min. :-2.30 | Min. :-893.100 | Min. : 0.00 | Min. :-2.00 | Min. : 0.00 | Min. :-3.48 | Min. : 0.00 | Min. :-8.31 | :18335 | |
| 19 : 1421 | 1st Qu.: 31494320 | 1st Qu.: 31525315 | 1st Qu.:-1.00000 | q22.1 : 489 | ENSG00000000005: 1 | AGAP9 : 2 | Gain : 1092 | 1st Qu.: 0.00000 | 1st Qu.: 0.01000 | 1st Qu.: 0.00000 | 1st Qu.:0.00000 | 1st Qu.:0.00000 | DLOH: 1 | 1st Qu.:0.4700 | down: 460 | 1st Qu.: 0.23 | 1st Qu.: 27.00 | 1st Qu.:-0.37 | 1st Qu.: -1.480 | 1st Qu.: 23.00 | 1st Qu.:-0.40 | 1st Qu.: 22.00 | 1st Qu.:-0.42 | 1st Qu.: 8.00 | 1st Qu.:-0.83 | putative tumour suppressor : 667 | |
| 11 : 1281 | Median : 57837876 | Median : 57882616 | Median : 1.00000 | q13.2 : 433 | ENSG00000000419: 1 | CT45A5 : 2 | Homozygous Loss: 79 | Median : 0.00000 | Median : 0.01000 | Median : 0.02000 | Median :0.00000 | Median :0.00000 | HET :19503 | Median :0.5000 | up : 224 | Median : 3.17 | Median : 51.00 | Median : 0.08 | Median : -1.060 | Median : 47.00 | Median :-0.01 | Median : 48.00 | Median : 0.00 | Median : 42.00 | Median :-0.18 | oncogene : 303 | |
| 2 : 1224 | Mean : 74255388 | Mean : 74321328 | Mean : 0.01393 | p13.2 : 425 | ENSG00000000457: 1 | DCDC1 : 2 | Loss : 2147 | Mean :-0.05834 | Mean : 0.00147 | Mean : 0.00926 | Mean :0.00123 | Mean :0.00405 | HOMD: 14 | Mean :0.5027 | NA | Mean : 22.74 | Mean : 50.92 | Mean : Inf | Mean : -1.208 | Mean : 47.83 | Mean : Inf | Mean : 48.95 | Mean : Inf | Mean : 43.87 | Mean : Inf | tumour suppressor : 128 | |
| 17 : 1159 | 3rd Qu.:111340848 | 3rd Qu.:111383246 | 3rd Qu.: 1.00000 | q11.2 : 366 | ENSG00000000460: 1 | DIO3 : 2 | Neutral :16206 | 3rd Qu.: 0.00000 | 3rd Qu.: 0.02000 | 3rd Qu.: 0.04000 | 3rd Qu.:0.00000 | 3rd Qu.:0.00000 | NLOH: 6 | 3rd Qu.:0.5400 | NA | 3rd Qu.: 9.42 | 3rd Qu.: 76.00 | 3rd Qu.: 0.77 | 3rd Qu.: 1.200 | 3rd Qu.: 73.00 | 3rd Qu.: 0.66 | 3rd Qu.: 76.00 | 3rd Qu.: 0.73 | 3rd Qu.: 79.00 | 3rd Qu.: 0.84 | putative oncogene : 114 | |
| 3 : 1057 | Max. :249200395 | Max. :249214145 | Max. : 1.00000 | q21.3 : 348 | ENSG00000000938: 1 | DTX2 : 2 | No Data : 58 | Max. : 2.00000 | Max. : 0.42000 | Max. : 0.42000 | Max. :1.00000 | Max. :1.00000 | NA | Max. :0.9000 | NA | Max. :34198.14 | Max. :100.00 | Max. : Inf | Max. : 121.580 | Max. :100.00 | Max. : Inf | Max. :100.00 | Max. : Inf | Max. :100.00 | Max. : Inf | oncogene; putative tumour suppressor: 46 | |
| (Other):11432 | NA’s :92 | NA’s :92 | NA’s :92 | (Other):16935 | (Other) :19610 | (Other):19250 | NA | NA’s :92 | NA’s :92 | NA’s :92 | NA’s :92 | NA’s :92 | NA | NA’s :92 | NA | NA | NA’s :1347 | NA’s :2454 | NA | NA’s :760 | NA’s :1867 | NA’s :1347 | NA’s :2331 | NA’s :1347 | NA’s :2438 | (Other) : 23 |
dim(chek2_rna_cnv)
## [1] 19616 27
Note 1. I tried to use the functions from readr (read_table(), read_table2 and read_tsv()), but the column names with spaces (e.g. Ensembl gene ID) were not converted to column names with dot separation like it is done in read.table (E.g. Ensembl gene ID -> Ensembl.gene.ID.), which messed up the full data frame. I had error messages when trying to use dget() and dput. Therefore, why change a functional approach: I decided to keep read.table to do this homework.
Let’s clean up the data first:
chek2_rna_cnv <- with(chek2_rna_cnv, chek2_rna_cnv[!(chr==""),])
chek2_rna_cnv <- with(chek2_rna_cnv, chek2_rna_cnv[!(start==""),])
chek2_rna_cnv <- with(chek2_rna_cnv, chek2_rna_cnv[!(end==""),])
chek2_rna_cnv <- with(chek2_rna_cnv, chek2_rna_cnv[!(Ensembl.gene.ID==""),])
chek2_rna_cnv <- with(chek2_rna_cnv, chek2_rna_cnv[!(hugo==""),])
chek2_rna_cnv$cancer.gene.type <- as.character(chek2_rna_cnv$cancer.gene.type)
chek2_rna_cnv$cancer.gene.type[chek2_rna_cnv$cancer.gene.type==""] <- "NA"
chek2_rna_cnv$cancer.gene.type <- as.factor(chek2_rna_cnv$cancer.gene.type)
gene_types <- levels(chek2_rna_cnv$cancer.gene.type)
chek2_rna_cnv <- chek2_rna_cnv %>%
group_by(hugo) %>%
slice(1L)
dim(chek2_rna_cnv)
## [1] 19182 27
** Note 1.** You might have noticed that I manipulated the cancer.gene.type variable to change it to a character in order to replace the cancer.gene.type empty cells by “NA”, then converted back to a factor, which partially answers exercise (1A) (see below)
OBJECTIVES:
Note 1. I searched several online dictionaries, and I still do not understand the meaning of “principled”: the online dictionaries are giving me answers like “based on moral rules” and “(of a system or method) based on a given set of rules”, so I will conclude that if my approach is coherent and follow a logical path, it is principled.
Transform some of the variable of chek2_rna_cnv into factors. In cleaning up the dataset, we converted a factors column into characters then back to factors.
Let’s see what class of data we have in chek2_rna_cnv first.
sapply(chek2_rna_cnv, class) %>% kable(format = "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), position = "center")
| chr | factor |
| start | integer |
| end | integer |
| strand | integer |
| cytoband | factor |
| Ensembl.gene.ID | factor |
| hugo | factor |
| copy.category | factor |
| copy.change | integer |
| avg.cna | numeric |
| avg.cna.by.gene | numeric |
| breakpoint | integer |
| manually.curated.homd | integer |
| loh | factor |
| loh_ratio | numeric |
| intepreted.expression.status | factor |
| RPKM | numeric |
| SARC.percentile | integer |
| SARC.kIQR | numeric |
| FC.mean.Bodymap | numeric |
| avg.TCGA.percentile | integer |
| avg.TCGA.kIQR | numeric |
| avg.TCGA.norm.percentile | integer |
| avg.TCGA.norm.kIQR | numeric |
| UCEC.norm.percentile | integer |
| UCEC.norm.kIQR | numeric |
| cancer.gene.type | factor |
Change the avg.TCGA.percentile (class integer) to factors with base R:
d0 <- chek2_rna_cnv
d0$avg.TCGA.percentile <- as.factor(d0$avg.TCGA.percentile)
d0 %>%
select(hugo,avg.TCGA.percentile) %>%
head(10) %>% kable(format = "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), position = "center")
| hugo | avg.TCGA.percentile |
|---|---|
| A1BG | 85 |
| A1CF | 32 |
| A2M | 90 |
| A2ML1 | 70 |
| A4GALT | 62 |
| A4GNT | 43 |
| AAAS | 11 |
| AACS | 3 |
| AADAC | 75 |
| AADACL2 | 77 |
nlevels(d0$avg.TCGA.percentile)
## [1] 101
Change the avg.TCGA.percentile (class integer) to factors with forcats:
d0$avg.TCGA.percentile <- as_factor(d0$avg.TCGA.percentile)
d0 %>%
select(hugo,avg.TCGA.percentile) %>%
head(10) %>% kable(format = "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), position = "center")
| hugo | avg.TCGA.percentile |
|---|---|
| A1BG | 85 |
| A1CF | 32 |
| A2M | 90 |
| A2ML1 | 70 |
| A4GALT | 62 |
| A4GNT | 43 |
| AAAS | 11 |
| AACS | 3 |
| AADAC | 75 |
| AADACL2 | 77 |
In my case, I obtain the same result with base R and forcats with the avg.TCGA.percentile as factors, although I know that base R has a tendency to try re-ordering the data, which is not alway what we want, but the code syntax I used did not lead to this problem for me.
Filter chek2_rna_cnv to remove all the avg.TCGA.percentile equal to 0 and drop unused levels.
So we are starting with 101 levels (0 to 100, inclusively)
nlevels(d0$avg.TCGA.percentile)
## [1] 101
d1 <- d0 %>%
filter(avg.TCGA.percentile != 0) %>%
arrange(avg.TCGA.percentile) %>%
select(hugo, avg.TCGA.percentile)
nlevels(d1$avg.TCGA.percentile)
## [1] 101
d1 <- data.frame(d1)
d1 <- gdata::drop.levels(d1)
# length(levels(d1$avg.TCGA.percentile)) # alternative method to count levels
nlevels(d1$avg.TCGA.percentile)
## [1] 100
Now let’s see how we can do this with base R.
d1_v2 <- d0 %>%
filter(avg.TCGA.percentile != 0) %>%
arrange(avg.TCGA.percentile) %>%
select(hugo, avg.TCGA.percentile)
class(d1_v2)
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
d1_v2 <- ungroup(d1_v2)
class(d1_v2)
## [1] "tbl_df" "tbl" "data.frame"
nlevels(d1_v2$avg.TCGA.percentile)
## [1] 101
d1_v2 <- d1_v2 %>%
droplevels()
nlevels(d1_v2$avg.TCGA.percentile)
## [1] 100
And now let’s do the same with forcats.
d1_v3 <- d0 %>%
filter(avg.TCGA.percentile != 0) %>%
arrange(avg.TCGA.percentile) %>%
select(hugo, avg.TCGA.percentile)
nlevels(d1_v3$avg.TCGA.percentile)
## [1] 101
d1_v3 <- ungroup(d1_v3)
d1_v3$avg.TCGA.percentile %>%
fct_drop() %>%
nlevels()
## [1] 100
Note 1. In the example above, you can see that the number of levels is initially 101 (length function) and after removing the avg.TCGA.percentile equal to 0, the number of levels is 100, so I successfully dropped the unused level “0”.
Note 2. To do this, I used a function of the gdata package, as exemplified here
Note 3. I had trouble with further manipulations on the data.frame because its class was grouped (“grouped_df” “tbl_df” “tbl” “data.frame”) so I had to use the ungroup() function as shown here.
Use the forcats package to change the order of the factor levels, based on a principled summary of one of the quantitative variables. Consider experimenting with a summary statistic beyond the most basic choice of the median.
I will plot the RPKM value according to each hugo gene.
d2 <- chek2_rna_cnv %>%
filter(avg.TCGA.percentile != 0) %>%
arrange(avg.TCGA.percentile)
d2 <- ungroup(d2)
p1 <- d2 %>%
group_by(hugo) %>%
arrange(RPKM) %>%
ggplot(aes(x=hugo, y=RPKM))
p1 + geom_point(aes(colour=cancer.gene.type), alpha = 0.6) +
theme(text = element_text(size=12), axis.title.x=element_blank(), axis.text.x = element_blank(), axis.ticks.x=element_blank())
Note 1. This is not very pretty, I would like to re-order the hugo gene according to their increasing value of RPKM.
Note 2. Using the arrange() function did not resolve this problem because the factors “hugo” are not re-arranged according to RPKM. We need additional functions to do so (see below for examples).
I will arrange the order of the factor hugo according to the ascending value of RPKM, as previously exemplified in my homework 3.
d2$hugo <- factor(d2$hugo, levels = d2$hugo[order(d2$RPKM)])
p1 <- d2 %>%
group_by(hugo) %>%
arrange(RPKM) %>%
filter(cancer.gene.type != "NA") %>%
ggplot(aes(x=hugo, y=RPKM))
p1 + geom_point(aes(colour=cancer.gene.type), alpha = 0.6) +
theme(text = element_text(size=12), axis.title.x=element_blank(), axis.text.x = element_blank(), axis.ticks.x=element_blank())
I will start by changing the lcass of avg.TCGA.percentile back to a numeric so that I can do a mean function on it. Then, I will show the difference between ordered and un-ordered factors cancer.gene.type according to the mean.
chek2_rna_cnv$avg.TCGA.percentile <- as.numeric(chek2_rna_cnv$avg.TCGA.percentile)
chek2_rna_cnv <- ungroup(chek2_rna_cnv)
d3 <- chek2_rna_cnv %>%
filter(avg.TCGA.percentile != 0) %>%
arrange(avg.TCGA.percentile)
d3.mean.percentile.by.type <- d3 %>%
group_by(cancer.gene.type) %>%
dplyr::summarize(mean.percentile.by.type = mean(avg.TCGA.percentile))
d3.mean.percentile.by.type %>% kable(format = "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), position = "center")
| cancer.gene.type | mean.percentile.by.type |
|---|---|
| NA | 47.96121 |
| oncogene | 49.04714 |
| oncogene; putative tumour suppressor | 41.17778 |
| oncogene; tumour suppressor | 40.66667 |
| putative oncogene | 44.84821 |
| putative oncogene; putative tumour suppressor | 46.14286 |
| putative oncogene; tumour suppressor | 21.33333 |
| putative tumour suppressor | 50.62422 |
| tumour suppressor | 42.50781 |
# cancer.gene.type factors in alphabetical order
p3 <- d3.mean.percentile.by.type %>%
ggplot(aes(x=cancer.gene.type, y = mean.percentile.by.type))
p3 + geom_point() + theme(text = element_text(size=12), axis.text.x = element_text(angle=45, hjust=1))
# cancer.gene.type factors ordered according to the mean.percentile.by.type
p3 <- d3.mean.percentile.by.type %>%
ggplot(aes(x=fct_reorder(cancer.gene.type, mean.percentile.by.type), y = mean.percentile.by.type))
p3 + geom_point() + theme(text = element_text(size=12), axis.text.x = element_text(angle=45, hjust=1))
We can also order the data.frame factors according to their frequency.
levels(d3$cancer.gene.type) %>% kable(format = "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), position = "center")
| NA |
| oncogene |
| oncogene; putative tumour suppressor |
| oncogene; tumour suppressor |
| putative oncogene |
| putative oncogene; putative tumour suppressor |
| putative oncogene; tumour suppressor |
| putative tumour suppressor |
| tumour suppressor |
d3$cancer.gene.type %>% forcats::fct_infreq() %>% levels() %>% kable(format = "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), position = "center")
| NA |
| putative tumour suppressor |
| oncogene |
| tumour suppressor |
| putative oncogene |
| oncogene; putative tumour suppressor |
| oncogene; tumour suppressor |
| putative oncogene; putative tumour suppressor |
| putative oncogene; tumour suppressor |
Note 1. So the most frequent factor is “NA”, followed by “putative tumour suppressor”, etc.
We can also re-order the factors avg.TCGA.percentile according to their frequency:
d3$avg.TCGA.percentile <- factor(d3$avg.TCGA.percentile)
levels(d3$avg.TCGA.percentile) %>% head(5) %>% kable(format = "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), position = "center")
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
d3$avg.TCGA.percentile %>% forcats::fct_infreq() %>% levels() %>% head(5) %>% kable(format = "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), position = "center")
| 49 |
| 2 |
| 1 |
| 3 |
| 48 |
Note 1. The most frequent avg.TCGA.percentile factor is 49, therefore, we know that there is a high number of hugo genes with a gene expression level at the 49th percentile.
PLOT EXAMPLE
We will plot the avg.TCGA.percentile factors in un-ordered way, then ordered by frequency. We will only use the 2000 first data entries.
d2$avg.TCGA.percentile <- factor(d2$avg.TCGA.percentile)
p4 <- d2 %>%
group_by(cancer.gene.type) %>%
head(2000) %>%
ggplot(aes(x=avg.TCGA.percentile))
p4 + geom_bar(aes(fill=cancer.gene.type))
p4 <- d2 %>%
group_by(cancer.gene.type) %>%
head(2000) %>%
ggplot(aes(x=fct_infreq(avg.TCGA.percentile)))
p4 + geom_bar(aes(fill=cancer.gene.type))
Note 1. The reason why the factors are not the same than illustrated in the previous example is because I only selected the first 2000 data entries and the factors were already ordered in ascending fashion in the d2 dataset.
Resources:
Answer: As exemplified above, arrange() is not sufficient to re-order data according to a factor. For example, the hugo genes were not re-ordered according to the RPKM even after using arrange(RPKM), and the visual output is not very intuitine. We need to use functions of base R or forcats to do this (see examples above).
Answer: As mentioned previously, factors have to be treated a different way. When we are ploting numerical values according to numerical values, we can use arrange to re-order the data, but when we use factors, it requires extra-steps in order to go around “R who is trying to re-order your factors data for you (without your permission)”.
Answer: Please refer above for several tables, ways of counting and dropping levels and plots.
Resources: tidyverse and readr
In all honesty, I spent quite a bit of time on visual aspects of ggplot graph in in my homework 3 since I needed it for research-related problems at that time as well.
Therefore, I will try something different here: making a color scheme for the hugo genes based on the cancer.gene.type category they belong to. The entire code below is based on the work of Jenny Bryan, specifically tutorial on ggplot2 here (create color scheme for data) and here (plot the data).
First, the values of cancer.gene.type (stored in gene_types) are too long, let’s make abbreviations for them, and replace the cancer.gene.type by the abbreviation in d5.
abbreviations_gene_types <- factor(c("unknown", "ONC", "ONC.pTS", "ONC.TS", "pONC", "pONC.pTS", "pONC.TS", "pTS", "TS"))
abbreviations_gene_types <- as.character(abbreviations_gene_types)
abbr_table <- data.frame(Abbreviation = abbreviations_gene_types, cancer.gene.type = gene_types)
abbr_table %>% kable(format = "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), position = "center")
| Abbreviation | cancer.gene.type |
|---|---|
| unknown | NA |
| ONC | oncogene |
| ONC.pTS | oncogene; putative tumour suppressor |
| ONC.TS | oncogene; tumour suppressor |
| pONC | putative oncogene |
| pONC.pTS | putative oncogene; putative tumour suppressor |
| pONC.TS | putative oncogene; tumour suppressor |
| pTS | putative tumour suppressor |
| TS | tumour suppressor |
d5 <- d0
d5$cancer.gene.type <- as.character(d5$cancer.gene.type)
d5$cancer.gene.type <- gsub("NA", "unknown", d5$cancer.gene.type)
d5$cancer.gene.type <- gsub("oncogene", "ONC", d5$cancer.gene.type)
d5$cancer.gene.type <- gsub("tumour suppressor", "TS", d5$cancer.gene.type)
d5$cancer.gene.type <- gsub("putative", "p", d5$cancer.gene.type)
d5$cancer.gene.type <- as.vector(d5$cancer.gene.type)
d5$cancer.gene.type <- gsub("; ", ".", d5$cancer.gene.type)
d5$cancer.gene.type <- gsub(" ", "", d5$cancer.gene.type)
cancer.gene.type-level info
d5 <- d5 %>%
filter(cancer.gene.type != "unknown")
dim(d5)
## [1] 1281 27
d5.cancer.gene.type <- aggregate(hugo~cancer.gene.type, d5, function(x) length(unique(x)))
n.cancer.gene.type <- nrow(d5.cancer.gene.type)
n.cancer.gene.type
## [1] 8
d5.cancer.gene.type %>% kable(format = "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), position = "center")
| cancer.gene.type | hugo |
|---|---|
| ONC | 303 |
| ONC.pTS | 46 |
| ONC.TS | 12 |
| pONC | 114 |
| pONC.pTS | 8 |
| pONC.TS | 3 |
| pTS | 667 |
| TS | 128 |
Note 1. I realized later on that it was not feasible to keep the unknown values, there were too many, so I had to filter them out.
Map cancer.gene.type into colours using the RColorBrewer package
display.brewer.all(type = "div")
color_anchors <-
list(
# unknown = brewer.pal(n=3, 'RdBu')[7:9], # Removed, causing issues
ONC = brewer.pal(n=10, 'RdYlBu')[1:4], #
ONC.pTS = brewer.pal(n=10, 'PuOr')[2:5], #
ONC.TS = brewer.pal(n=10, 'RdYlBu')[5:8], #
pONC = brewer.pal(n=10, 'PiYG')[1:4], #
pONC.pTS=brewer.pal(n=3,'BrBG')[3:4], #
pONC.TS = brewer.pal(n=3, 'PRGn')[3:4], #
pTS = brewer.pal(n=10, 'RdBu')[7:11], #
TS= brewer.pal(n=10, 'PiYG')[7:11])
Select the first or darkest color to represent the whole cancer.gene.type category
d5.cancer.gene.type$color <- lapply(color_anchors, "[",1 )
Expand into palettes big enough to cover each gene in a continent
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
d5 <- data.frame(d5)
hugo_colors <- plyr::ddply(d5, ~cancer.gene.type, function(x) {
the.cancer.gene.type <- x$cancer.gene.type[1]
x <- droplevels(x)
hugo.lowtohigh <-
with(x, levels(reorder(hugo, RPKM)))
colorFun <- colorRampPalette(color_anchors[[the.cancer.gene.type]])
return(data.frame(hugo = I(hugo.lowtohigh),
color = I(colorFun(length(hugo.lowtohigh)))))
})
## fiddly parameters that control printing of hugo names
charLimit <- 12
xFudge <- 0.05
jCex <- 0.75
Note 1. ddply() is a function from the plyr package, as mentioned here
We can then store the figure making code as a function so can make pdf and png.
make_figure <- function() {
plot(c(0, n.cancer.gene.type), c(0,1), type = "n",
xlab = "", ylab="", xaxt = "n", yaxt = "n", bty = "n",
main="CHEK2 data - Cancer Gene Type - Color Scheme")
for(i in seq_len(n.cancer.gene.type)) {
this.cancer.gene.type <- d5.cancer.gene.type$cancer.gene.type[i]
nCols <- with(d5.cancer.gene.type, hugo[cancer.gene.type==this.cancer.gene.type])
yFudge = 0.1/nCols
foo <- seq(from = 0, to =1, length.out = nCols+1)
rect(xleft = i-1,
ybottom = foo[-(nCols + 1)],
xright = i,
ytop = foo[-1],
col=with(hugo_colors, color[cancer.gene.type==this.cancer.gene.type]))
text(x = i - 1 + xFudge,
y = foo[-(nCols + 1)] + yFudge,
labels = with(hugo_colors,
substr(hugo[cancer.gene.type==this.cancer.gene.type], 1, charLimit)),
adj = c(0, 0), cex = jCex)
}
mtext(d5.cancer.gene.type$cancer.gene.type, side = 3, at = seq_len(n.cancer.gene.type) - 0.5)
mtext(c("highest.RPKM", "smallest.RPKM"),
side = 2, at = c(0.9, 0.1), las = 1)
}
op <- par(mar = c(1, 4, 6, 1) + 0.1)
make_figure()
par(op)
png("chek2-cancer.gene.type-colors.png",
width = 30, height = 70, units = "in", res = 200)
op <- par(mar = c(1, 4, 6, 1) + 0.1)
make_figure()
dev.off()
## quartz_off_screen
## 2
par(op)
pdf("chek2-cancer.gene.type-colors.pdf",
width = 30, height = 70)
op <- par(mar = c(1, 4, 6, 1) + 0.1)
make_figure()
dev.off()
## quartz_off_screen
## 2
par(op)
write.table(hugo_colors, "scratch-space2/chek2-hugo-colors.tsv",
quote = FALSE, sep = "\t", row.names = FALSE)
d5.cancer.gene.type$cancer.gene.type <- as.character(d5.cancer.gene.type$cancer.gene.type)
d5.cancer.gene.type$hugo <- as.character(d5.cancer.gene.type$hugo)
d5.cancer.gene.type$color <- as.character(d5.cancer.gene.type$color)
write.table(d5.cancer.gene.type, "scratch-space2/chek2-cancer.gene.type-colors.tsv",
quote = FALSE, sep = "\t", row.names = FALSE)
Note 1. I had to convert the d5.cancer.gene.type table to characters before being able to write to a tsv file, as mentioned in this stack overflow discussion here.
Let’s write a figure to a file, then load it back to embed it in this report.
Merge color into data, as shown here
library(plyr)
hugo_colors_v2 <- read.delim("scratch-space2/chek2-hugo-colors.tsv", colClasses = list(color = "character"), col.names = c("cancer.gene.type", "hugo", "hugo_color"))
## Warning in read.table(file = file, header = header, sep = sep, quote =
## quote, : not all columns named in 'colClasses' exist
str(hugo_colors_v2)
## 'data.frame': 1281 obs. of 3 variables:
## $ cancer.gene.type: Factor w/ 8 levels "ONC","ONC.pTS",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ hugo : Factor w/ 1281 levels "ABI1","ABI2",..: 194 396 399 400 530 740 742 846 850 1116 ...
## $ hugo_color : Factor w/ 1025 levels "#053061","#053062",..: 555 555 555 557 557 562 562 565 565 567 ...
#insert color as a variable into d5
d5 <- merge(d5, hugo_colors_v2)
Then we can use this to make plots, as Jenny Bryan showed here
color_scheme_hugo <- d5 %>%
select(hugo, hugo_color)
head(color_scheme_hugo)
## hugo hugo_color
## 1 ABI1 #F67B49
## 2 ABI2 #0E4179
## 3 ABL1 #F78950
## 4 ABL2 #D93529
## 5 ACACA #C51B7D
## 6 ACAD8 #B2126E
p5 <- d5 %>%
filter(cancer.gene.type != "unknown") %>%
group_by(cancer.gene.type) %>%
ggplot(aes(x=log2(RPKM), y=FC.mean.Bodymap), na.rm=TRUE)
p5 + geom_point(aes(shape=copy.category)) +
scale_size_continuous(range=c(1,40)) +
facet_wrap(~cancer.gene.type) +
aes(colour= hugo_color) + scale_colour_identity() +
theme_bw() + theme(strip.text = element_text(size = rel(1.1)))
ggsave("scratch-space2/chek2_cancer_gene_type_color_plot.png", width = 20, height = 20, units = "cm", dpi = 300)
Note 1. In the plot above, one might want to focus on the genes that have low expression (low RPKM) and copy loss, or high expression and copy gain, as these paired-values are more likely to be biologically relevant.
(plot removed for html format)
If you are generating two plots in the same code chunk, you need to specify which plot to save with ggsave. For example, let’s take the same plots previously shown.
p4 <- d2 %>%
group_by(cancer.gene.type) %>%
head(2000) %>%
ggplot(aes(x=avg.TCGA.percentile))
p4_unordered <- p4 + geom_bar(aes(fill=cancer.gene.type))
p4 <- d2 %>%
group_by(cancer.gene.type) %>%
head(2000) %>%
ggplot(aes(x=fct_infreq(avg.TCGA.percentile)))
p4 + geom_bar(aes(fill=cancer.gene.type))
p4_ordered <- p4 + geom_bar(aes(fill=cancer.gene.type))
ggsave("scratch-space2/p4_ordered.png", width = 20, height = 20, units = "cm", dpi = 300)
By default, ggsave use the last stored plot to save to the file, so if you want to save the first plot, you have to specify which plot you want to save with the argument plot = p4_unordered in our case.
p4 <- d2 %>%
group_by(cancer.gene.type) %>%
head(2000) %>%
ggplot(aes(x=avg.TCGA.percentile))
p4_unordered <- p4 + geom_bar(aes(fill=cancer.gene.type))
p4 <- d2 %>%
group_by(cancer.gene.type) %>%
head(2000) %>%
ggplot(aes(x=fct_infreq(avg.TCGA.percentile)))
p4 + geom_bar(aes(fill=cancer.gene.type))
p4_ordered <- p4 + geom_bar(aes(fill=cancer.gene.type))
ggsave("scratch-space2/p4_unordered.png", plot=p4_unordered, width = 20, height = 20, units = "cm", dpi = 300)
Resources:
I have used some strategies to keep my repository clean and organized: